{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:2.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Regresssion with scikit-learn<br><br> using Soccer Dataset\n",
    "\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will again be using the open dataset from the popular site <a href=\"https://www.kaggle.com\">Kaggle</a> that we used in Week 1 for our example. \n",
    "\n",
    "Recall that this <a href=\"https://www.kaggle.com/hugomathien/soccer\">European Soccer Database</a> has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016. \n",
    "\n",
    "**Note:** Please download the file *database.sqlite* if you don't yet have it in your *Week-7-MachineLearning* folder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Import Libraries<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sqlite3\n",
    "import pandas as pd \n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from math import sqrt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Read Data from the Database into pandas\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create your connection.\n",
    "cnx = sqlite3.connect('database.sqlite')\n",
    "df = pd.read_sql_query(\"SELECT * FROM Player_Attributes\", cnx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Declare the Columns You Want to Use as Features\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "features = [\n",
    "       'potential', 'crossing', 'finishing', 'heading_accuracy',\n",
    "       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',\n",
    "       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',\n",
    "       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',\n",
    "       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',\n",
    "       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',\n",
    "       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',\n",
    "       'gk_reflexes']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Specify the Prediction Target\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "target = ['overall_rating']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Clean the Data<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Extract Features and Target ('overall_rating') Values into Separate Dataframes\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = df[features]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = df[target]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us look at a typical row from our features: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X.iloc[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us also display our target values: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Split the Dataset into Training and Test Datasets\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "(1) Linear Regression: Fit a model to the training set\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "regressor = LinearRegression()\n",
    "regressor.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Perform Prediction using Linear Regression Model\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_prediction = regressor.predict(X_test)\n",
    "y_prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "What is the mean of the expected target value in test set ?\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_test.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Evaluate Linear Regression Accuracy using Root Mean Square Error\n",
    "\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(RMSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "(2) Decision Tree Regressor: Fit a new regression model to the training set\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "regressor = DecisionTreeRegressor(max_depth=20)\n",
    "regressor.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Perform Prediction using Decision Tree Regressor\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_prediction = regressor.predict(X_test)\n",
    "y_prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "For comparision: What is the mean of the expected target value in test set ?\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_test.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-family: Arial; font-size:1.75em;color:purple; font-style:bold\"><br>\n",
    "\n",
    "Evaluate Decision Tree Regression Accuracy using Root Mean Square Error\n",
    "\n",
    "<br><br></p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(RMSE)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
